The accuracy and quality of image-based artificial intelligence for muscle-invasive bladder cancer prediction

Purpose To evaluate the diagnostic performance of image-based artificial intelligence (AI) studies in predicting muscle-invasive bladder cancer (MIBC). (2) To assess the reporting quality and methodological quality of these studies by Checklist for Artificial Intelligence in Medical Imaging (CLAIM), Radiomics Quality Score (RQS), and Prediction model Risk of Bias Assessment Tool (PROBAST). Materials and methods We searched Medline, Embase, Web of Science, and The Cochrane Library databases up to October 30, 2023. The eligible studies were evaluated using CLAIM, RQS, and PROBAST. Pooled sensitivity, specificity, and the diagnostic performances of these models for MIBC were also calculated. Results Twenty-one studies containing 4256 patients were included, of which 17 studies were employed for the quantitative statistical analysis. The CLAIM study adherence rate ranged from 52.5% to 75%, with a median of 64.1%. The RQS points of each study ranged from 2.78% to 50% points, with a median of 30.56% points. All models were rated as high overall ROB. The pooled area under the curve was 0.85 (95% confidence interval (CI) 0.81–0.88) for computed tomography, 0.92 (95% CI 0.89–0.94) for MRI, 0.89 (95% CI 0.86–0.92) for radiomics and 0.91 (95% CI 0.88–0.93) for deep learning, respectively. Conclusion Although AI-powered muscle-invasive bladder cancer-predictive models showed promising performance in the meta-analysis, the reporting quality and the methodological quality were generally low, with a high risk of bias. Critical relevance statement Artificial intelligence might improve the management of patients with bladder cancer. Multiple models for muscle-invasive bladder cancer prediction were developed. Quality assessment is needed to promote clinical application. Key Points Image-based artificial intelligence models could aid in the identification of muscle-invasive bladder cancer. Current studies had low reporting quality, low methodological quality, and a high risk of bias. Future studies could focus on larger sample sizes and more transparent reporting of pathological evaluation, model explanation, and failure and sensitivity analyses. Graphical Abstract


Introduction
Bladder cancer (BCa) constitutes a significant global health challenge, with an estimated 550,000 new cases and 200,000 deaths worldwide annually [1].Muscle-invasive bladder cancer (MIBC) is a particularly aggressive form of BCa, defined by the invasion of the tumor into or beyond the superficial muscularis propria of the bladder wall [2].This subtype is characterized by higher mortality rates, earlier metastasis, and a worse prognosis compared to non-muscle-invasive bladder cancer (NMIBC) [3,4].Identifying MIBC promptly is crucial, as it necessitates more aggressive treatments, including radical cystectomy (RC) and adjuvant therapy, which are critical for improving patient outcomes [4,5].
Clinically, cystoscopy with transurethral resection of bladder tumor (TURBT) is usually the diagnostic approach for identifying MIBC in patients suspected of BCa.While effective, this invasive procedure can occasionally under-sample muscular tissue, resulting in false negative rates of approximately 10% to 15% [6].The Vesical Imaging-Reporting and Data System (VI-RADS), based on multiparametric magnetic resonance imaging (MRI), has emerged as a valuable non-invasive alternative, offering high sensitivity and specificity in differentiating MIBC from NMIBC [7,8].However, the utility of VI-RADS is limited by the long acquisition times, high costs of MRI examinations, and dependence on the subjective experience of the radiologist interpreting the images.
Recent advancements in artificial intelligence (AI), particularly in radiomics and deep learning (DL), provide a promising avenue for the pre-operative identification of MIBC.AI techniques can analyze medical images by extracting hand-crafted radiomic features or using selflearned DL features to predict disease status through sophisticated classification algorithms [9][10][11].These technologies have shown potential in enhancing the accuracy and efficiency of MIBC diagnosis .Despite the promise, there is a wide variation in the reported results across studies [33,34].Furthermore, the overall quality of these studies has not been thoroughly assessed, especially concerning critical methodological aspects such as patient selection, model development, and performance evaluation [33,34], hindering the clinical application of AI techniques in identifying MIBC.
In this study, we aimed to (1) systematically review and evaluate the diagnostic performances of current AI studies on the prediction of MIBC and (2) use the Checklist for Artificial Intelligence in Medical Imaging (CLAIM), Radiomics Quality Score (RQS), and Prediction model Risk of Bias Assessment Tool (PROBAST) to comprehensively assess the reporting quality and methodological quality of these models [35][36][37].

Literature search strategy and study selection
The study protocol is registered in the International Prospective Register of Systematic Reviews (CRD42023446035).This systematic review was conducted according to the recommendations published in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) for Diagnostic Test Accuracy statement [38].PRISMA checklist is provided in the Supplementary material 1.
Medline via PubMed, Embase via Ovid, Web of Science, and The Cochrane Library were searched for eligible studies from inception to October 30, 2023 using a combination of the following terms: "bladder cancer", "muscle invasion or staging", "radiomics or deep learning", and "computed tomography or magnetic resonance imaging or ultrasound".The language was restricted to English.The detailed search queries were displayed in Supplementary Material 2. In addition, we screened the bibliographies of initially searched articles for additional relevant studies.
After the removal of duplicated studies, two researchers with 2 and 7 years of experience in genitourinary imaging screened the titles and abstracts of the identified studies.Studies were excluded if the type of the article was one of the following: review, editorial, abstract, and case report.The remaining studies were full-text assessed.To be included, the articles must have fulfilled the following: (1) population: patients with primary BCa; (2) index test: development or validation of radiomics or DL models using computed tomography (CT), MR, or ultrasound images; (3) outcomes: the muscle invasion status confirmed by at least one pathological evaluation method; (4) original articles.Studies were excluded from metaanalysis if they lacked adequate data sufficient to reconstruct the 2 × 2 contingency table.

Data extraction
Relevant data were extracted from each eligible publication using a standardized form recording the following information: study year, data collection strategy, number of centers, target population, prediction level, sample size, MIBC ratio, gold standard, internal validation method, external validation method, modality, annotation method, number of readers per case, reader agreement, feature extraction method, number of extracted features, number of selected features, and final classifier algorithm.The feature number of DL models was the number of neurons in the first fully-connected layer since the convolutional layers were considered as feature extractors.The following diagnostic accuracy measures were also recorded for metaanalysis: true positive, true negative, false positive, and false negative.When a study involved training and test cohorts, the diagnostic performance in the test cohort was selected for the model's prediction power; When a study involved external and internal cohorts, the diagnostic performance in the external cohort was selected for the model's prediction power.If several prediction models were developed in one study, the model with the best performance was chosen.

CLAIM, RQS, and PROBAST evaluation
The same two researchers independently assessed all eligible publications with CLAIM, RQS, and PROBAST.When a discrepancy occurred, an agreement was reached after discussions with two senior researchers.The consensus data were used in the following analyses.For CLAIM, the study reporting was evaluated by a total of 42 items.The item adherence rate was the percentage of adhering studies over all applicable studies among the item, while the study adherence rate was the percentage of adhering items over all applicable items among the study.The RQS, which consists of 16 criteria, is a recently accepted tool to measure the methodological rigor of radiomics workflow.The total RQS points are the sum of points from checkpoint 1, checkpoint 2, and checkpoint 3, with the ideal RQS points being 100% (36/36.00).For PROBAST, the risk of bias (ROB) was assessed across four domains: participants, predictors, outcome, and analysis.Signaling questions within each domain were answered with one out of five options: "yes, " "probably yes, " "probably no, " "no, " "no information".If there was any "no/ probably no" in signaling questions, the domain was labeled as having high ROB.If all signaling question is "yes/probably yes", the domain was labeled as having low ROB.If there was no "no/probably no" but any "no information" in signaling questions, the domain was labeled as having unclear ROB.The overall ROB of the four domains was then determined using the same criteria.

Statistical analysis
Heterogeneity was evaluated using the following methods: (1) the Cochran Q test, with a p-value of < 0.05 indicating significant heterogeneity, and (2) the Higgins I 2 test.I 2 values of 0-25%, 25-50%, 50-75%, and > 75% represent very low, low, medium, and high heterogeneity, respectively.The weight of each study was calculated with the inverse variance method, in which the weight given to each study is chosen to be the inverse of the variance of the effect estimate, minimizing the uncertainty of the pooled effect estimate.In the case of medium and high heterogeneity, the random-effect model was favored over the fixed-effect model.Diagnostic accuracy was assessed using the hierarchical summary receiver operating characteristic (HSROC) curves and areas under the HSROC (AUC).Sensitivity, specificity, positive likelihood ratio, and negative likelihood ratio were also calculated.Metaregression was performed to explore the potential sources of heterogeneity.
All calculations were performed with a 95% confidence interval (95% CI).A difference was considered statistically significant when the p-value was smaller than 0.05.We used the "metandi" and "midas" modules in Stata 17 for statistical analyses [39,40].

Characteristics of included studies
A flowchart depicting the study selection process is shown in Fig. 1.The search strategy identified 171 studies after removing duplicates.Among these, 21 studies met the inclusion criteria.The included studies are summarized in Table 1 and Fig. 2.

AI technique details of the included studies
The AI technique details of the included studies are summarized in Table 2.All the studies analyzed a single modality.The most common modality was MRI (n = 12), followed by CT (n = 8) and ultrasound (n = 1).Manual annotation (n = 15) was the major method for delineating the region of interest (ROI), while other studies used semiautomatic (n = 4) or fully-automatic (n = 2) segmentation algorithms.Fourteen studies extracted hand-crafted radiomic features, five studies extracted self-learned DL features [13,19,21,24,29,32], and two studies extracted both types of features [23,30].Extracted feature counts ranged from 63 to 23,688 in the included studies, and the selected feature counts ranged from 6 to 2048.

Quality evaluation
The CLAIM study adherence rates, RQS points, and number of yes/probably yes in PROBAST of each study were shown in Fig. 3.

RQS
The RQS points of each study are shown in Fig. 5, and the median RQS points for each criterion are displayed in Table 4.The RQS points of each study ranged from 2.78% to 50% points, with a median of 30.56% points.The detailed RQS evaluation and checklist can be found in Supplementary Material 3.
Most of the studies (18/21, 85.7%) had presented their image protocols except for three [13,18,29], while no study reported the use of a public protocol.Less than half of the studies (6/21, 28.6%) did not perform multiple segmentations to control the inter-or intra-rater variability of feature extraction [13,14,24,29,30,32], no study analyzed interscanner differences and temporal variabilities of the features.All studies that used radiomic features and one study that used both radiomic and DL features reduced the dimension of features, while most DL-only studies (4/21, 19.0%) did not perform feature selection on DL features.Seven studies (33.3%) combined clinical information with radiomic models [15-18, 20, 25, 31], and nine (42.9%) compared the radiomic models with radiologist's diagnosis or VI-RADS category [15-17, 19, 20, 23, 28, 31, 32].The models performed better than radiologists in internal validation, but their generalizability to external validation data was not as good as experienced radiologists.Only one study discussed the relevance between radiomic features and clinical/histological phenotypes [22].Most studies (20/21, 95.2%) reported the discrimination statistics except for one [29], but only less than half of the studies (8/21, 38.1%) reported calibration statistics [15, 17-20, 25, 27, 30] as well as (9/21, 42.9%) the cut-off analysis [15, 17, 19, 20, 25-27, 31, 32].Only one study (4.8%) used prospectively collected data [24].Only one did not (1/21, 4.8%) perform validation of radiomic signatures [12], and seven (33.3%) externally validated their models on data from other institutes Fig. 3 Diagram showing reporting quality, methodological quality, and risk of bias of each study.The x-axis is the CLAIM adherence rate.The y-axis is the number of yes or probably yes in the PROBAST evaluation.The size of each point is the RQS points Fig. 4 Diagram showing reporting quality by year and sample size.The xaxis is the year of publication.The y-axis is the CLAIM adherence rate.The size of each point is the sample size [15,17,19,24,27,30,32].Only studies (38.1%) used decision curve analysis to determine the clinical utility of models [15, 17-20, 25, 26, 30].Finally, no study conducted cost-effectiveness analyses or shared code or representative data for model development and inference.

PROBAST
The results of the PROBAST evaluation are shown in Fig. 6 and Table 5.In total, all studies were rated as having a high overall risk of bias.Models in two studies were rated as unclear ROB in participants domain due to poor-documented eligibility criteria [13,29].In the predictors domain, models in four studies were rated as high ROB [24,26,28,29], and models in ten studies were rated as unclear ROB [14-16, 18, 20, 21, 23, 30-32].The annotation was done by more than one rater but the inter-rater variability was not analyzed, resulting in probably different predictor definitions for all participants.Blind annotation was the major source of high or unclear ROB.Raters in two studies annotated the image with knowledge of pathological evaluation results [24,26], and twelve studies did not specify whether the annotation was done blindly [14-16, 18, 20, 21, 23, 28-32].In the outcome domain, only one study was rated as low ROB [16] while the others were rated as unclear ROB.Most of the studies (20/21) did not report how MIBC was determined in pathological evaluation, how many pathologists were enrolled, and whether the annotation was done blindly.In the analysis domain, all models were rated as having a high overall ROB.Inadequate sample size and poor  The detailed PROBAST evaluation can be found in Supplementary Material 3.

Discussion
AI techniques have been widely studied in MIBC identification.Our systematic review comprehensively evaluated the reporting quality, methodological quality, and ROB of current AI studies for MIBC prediction.The results showed that the overall quality of these studies was poor, with a median CLAIM study adherence rate of 64.1%, a median RQS points percentage of 30.6%, and a high ROB among all studies.The meta-analysis showed current MIBC-predictive AI models had good performance with an AUC of 0.85 (95% CI 0.81-0.88)for CT, 0.92 (95% CI 0.89-0.94)for MRI, 0.89 (95% CI 0.86-0.92)for radiomics and 0.91 (95% CI 0.88-0.93)for deep learning.The current results indicate that AI models have a high potential for predicting MIBC but are far from useful tools in clinical practice.Two systematic reviews have previously evaluated radiomic studies for MIBC prediction using RQS [33,34].Most of the RQS results in our study were similar to theirs.However, a difference in the "Comparison to Gold standard" part was observed.In the previous reviews, most studies were assigned two points for comparing the models with the current gold standard.In our review, less than half of the studies were assigned two points.To show the added value of radiomics, we believe that the "gold standard" refers to the commonly-used non-invasive methods in current clinical practice for detecting MIBC (i.e., manual image interpretation with or without VI-RADS category) [4], thus we only assigned two points to nine studies that had compared the models with manual interpretations.The results of the nine studies showed the models usually performed better than radiologists in internal validation, but their generalizability to external validation data was not as good as experienced radiologists [15-17, 19, 20, 26, 28, 31, 32].
Using CLAIM and PROBAST, our systematic review identified some unique quality-reducing items in MIBC-predictive AI studies.Firstly, for the pathology gold standard, only one study reported how MIBC was confirmed through histopathological investigation [16].Most studies only reported the source of the specimen, and none of the studies reported the details of the histopathological investigation, including the number of pathologists, the inter-reader agreement, and the blindness of assessment.The results of the metaanalysis showed different pathological gold standards significantly contributed to the heterogeneity of sensitivity and specificity.About half of the studies used multiple surgical techniques to obtain the specimen, however, none of them reported the criteria for choosing the reference standard for individual patients or compared the model performance in patients who underwent TURBT with that in patients who underwent RC or PC.Secondly, statistical concerns were poorly considered in current studies.No study calculated the minimal sample size.When internally validating the model, only a few studies avoided overpessimistic or over-optimistic evaluation of model performance by using cross-validation or bootstrapping.Many studies used cross-validation to select the best hyperparameter or the best model in the training set.However, nested cross-validation is needed to evaluate the model performance while selecting optimal hyperparameters [41].In addition, only a few studies evaluated the calibration of AI models.Thirdly, current studies lacked analyses beyond performance evaluation.The explainability was poorly discussed, especially in radiomic studies, and few studies performed failure analysis or sensitivity analysis.Finally, no study the de-identification method for the clinical, pathological, and image data.Blind assessment during the pathological and radiological evaluation was also poorly reported.
There are some limitations in this study.First, only a subset of eligible studies met the selection criteria for metaanalyses, and significant heterogeneity existed among studies, thus it is important to interpret meta-analysis results with caution.Second, we only used RQS, PROBAST, and CLAIM in the evaluations.The quality of radiomics has always been a hot topic, which affects the repeatability and reproducibility of radiomics and limits its widespread clinical application.Therefore, various checklists and tools have been proposed for quality evaluation.Recently, the Check-List for EvaluAtion of Radiomics research (CLEAR) and METhodological RadiomICs Score (METRICS) were proposed and were regarded as better alternatives to CLAIM and RQS for radiomic studies [42,43].However, our studies did not include these newly developed tools.Thirdly, CLAIM, RQS, and PROBAST contain some elements of subjective judgment, and borderline results may impact overall interpretation.
In conclusion, although AI techniques show high diagnostic performance in predicting MIBC, the insufficient quality of studies suggests that these AI models are not currently available for clinical use.Future studies could focus on more transparent reporting of pathological evaluation, larger sample size, and additional analyses, such as prediction explanation, failure analysis, and sensitivity analysis.

Fig. 2
Fig. 2 Overview of study characteristics.A Aggregate number of patients included in the study; B Year of publication; C Data collection strategy; D Data source; E Internal validation method; F External validation method

Fig. 5
Fig. 5 Diagram showing methodological quality by year and sample size.The x-axis is the year of publication.The y-axis is the RQS points.The size of each point is the sample size

Fig. 6
Fig. 6 Diagram showing the risk of bias by year and sample size.The xaxis is the year of publication.The y-axis is the number of yes or probably yes in the PROBAST evaluation.The size of each point is the sample size

Table 1
Characteristics of included studies

Table 2
Methodological characteristics of the included US ultrasound, LASSO least absolute shrinkage and selection operator

Table 3
Reporting quality assessment using the CLAIM

Table 5
Risk of bias assessment using the PROBAST